Regression and correlation
Problem
You want to perform linear regressions and/or correlations.
Solution
Some sample data to work with:
# Make some data # X increases (noisily) # Z increases slowly # Y is constructed so it is inversely related to xvar and positively related to xvar*zvar set.seed(955) xvar <- 1:20 + rnorm(20,sd=3) zvar <- 1:20/4 + rnorm(20,sd=2) yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4) # Make a data frame with the variables df <- data.frame(x=xvar, y=yvar, z=zvar) # x y z # -4.252354091 4.5857688 1.89877152 # 1.702317971 -4.9027824 -0.82937359 # 4.323053753 -4.3076433 -1.31283495 # 1.780628408 0.2050367 -0.28479448 # ...
Correlation
# Correlation coefficient cor(df$x, df$y) # -0.7695378
Correlation matrices (for multiple variables)
It is also possible to run correlations between many pairs of variables, using a matrix or data frame.
# A correlation matrix of the variables cor(df) # x y z # x 1.0000000 -0.769537849 0.491698938 # y -0.7695378 1.000000000 0.004172295 # z 0.4916989 0.004172295 1.000000000 # Print with only two decimal places round(cor(df),2) # x y z # x 1.00 -0.77 0.49 # y -0.77 1.00 0.00 # z 0.49 0.00 1.00
To visualize a correlation matrix, see ../../Graphs/Correlation matrix.
Linear regression
Linear regressions, where df$x
is the predictor, and df$y
is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly.
# These two commands will have the same outcome: fit <- lm(y ~ x, data=df) # Using the columns x and y from the data frame fit <- lm(df$y ~ df$x) # Using the vectors df$x and df$y fit # Call: # lm(formula = y ~ x, data = df) # # Coefficients: # (Intercept) x # -0.2278 -1.1829 # This means that the predicted yvar = -.2278 - 1.1829*x # Get more detailed information: summary(fit) # Call: # lm(formula = y ~ x, data = df) # # Residuals: # Min 1Q Median 3Q Max # -15.8922 -2.5114 0.2866 4.4646 9.3285 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.2278 2.6323 -0.087 0.932 # x -1.1829 0.2314 -5.113 7.28e-05 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 6.506 on 18 degrees of freedom # Multiple R-squared: 0.5922, Adjusted R-squared: 0.5695 # F-statistic: 26.14 on 1 and 18 DF, p-value: 7.282e-05
To visualize the data with regression lines, see ../../Graphs/Scatterplots (ggplot2) and ../../Graphs/Scatterplot.
Linear regression with multiple predictors
Linear regression with yvar
as the outcome, and xvar
and zvar
as predictors.
Note that the formula specified below does not test for interactions between x and z.
# These have the same result fit2 <- lm(y ~ x + z, data=df) # Using the columns x, y, and z from the data frame fit2 <- lm(df$y ~ df$x + df$z) # Using the vectors xvar, yvar, and zvar fit2 # Call: # lm(formula = y ~ x + z, data = df) # # Coefficients: # (Intercept) x z # -1.382 -1.564 1.858 summary(fit2) # Call: # lm(formula = y ~ x + z, data = df) # # Residuals: # Min 1Q Median 3Q Max # -7.974 -3.187 -1.205 3.847 7.524 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -1.3816 1.9878 -0.695 0.49644 # x -1.5642 0.1984 -7.883 4.46e-07 *** # z 1.8578 0.4753 3.908 0.00113 ** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 4.859 on 17 degrees of freedom # Multiple R-squared: 0.7852, Adjusted R-squared: 0.7599 # F-statistic: 31.07 on 2 and 17 DF, p-value: 2.1e-06
Interactions
The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two.
To model interactions between x
and z
, a x:z
term must be added. Alternatively, the formula x*z
expands to x+z+x:z
.
# These are equivalent; the x*z expands to x + z + x:z fit3 <- lm(y ~ x * z, data=df) fit3 <- lm(y ~ x + z + x:z, data=df) # Call: # lm(formula = y ~ x + z + x:z, data = df) # # Coefficients: # (Intercept) x z x:z # 2.2820 -2.1311 -0.1068 0.2081 summary(fit3) # Call: # lm(formula = y ~ x + z + x:z, data = df) # # Residuals: # Min 1Q Median 3Q Max # -5.3045 -3.5998 0.3926 2.1376 8.3957 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 2.28204 2.20064 1.037 0.3152 # x -2.13110 0.27406 -7.776 8e-07 *** # z -0.10682 0.84820 -0.126 0.9013 # x:z 0.20814 0.07874 2.643 0.0177 * # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 4.178 on 16 degrees of freedom # Multiple R-squared: 0.8505, Adjusted R-squared: 0.8225 # F-statistic: 30.34 on 3 and 16 DF, p-value: 7.759e-07